The first part of this Rmd will be notes and exercise from the lecture series on the book Intro to Statistical Learning. The Second part of this Rmd document will contain notes and examples from the Book itself. ***
Starting Point
On the basis of the training data we would like to:
* Accurately predict unseen test cases.
* Understand which inputs affect the outcome, and how.
* Assess the quality of our predictions and inferences.
What is Statistical Learning?
expected value is a fancy word for the average value
Example of statistical learning: Can we predict Sales using these three variables?
\[ Sales \approx f(TV,Radio,Newspaper) \]
Here Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\). TV is a feature, or input, or predictor, we name it \(X_{1}\). Likewise name Radio as \(X_{2}\), and so on. We can refer to the input vector collectively as:
\[ X = \begin{pmatrix} X_{1} \\ X_{2} \\ X_{3} \end{pmatrix} \] Now we write our model as:
\[ Y = f(X) + \epsilon \]
where \(\epsilon\) captures measurements errors and other discrepancies.
Is there an ideal \(f(X)\)? In particular, what is a good value for \(f(X)\) at any selectd value of \(X\), say \(X\) = 4? There can be many \(Y\) values at \(X\) = 4. A good value is:
\[ f(4) = E(Y\mid X = 4) \]
\(E(Y\mid X = 4)\) means expected value (average) of \(Y\) given \(X\) =4.
This ideal \(f(x) = E(Y\mid X = x)\) is called the regression function.
This is called nearest neighbors or local averaging, relaxing the definition to allow for averaging of nearby points. This compensates for the fact the there may not be y point for any particular x point.
The linear model is an important example of a parametric model:
\[ F_{L}(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ...\beta_{p}X_{p} \]
* A linear model is specified in terms of \(p + 1\) parameters \(\beta_{0},\beta_{1},...,\beta_{p}\).
* We estimate the parameters by fitting the model to training data. * Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function \(f(X)\).
Suppose we fit a model \(\hat f(x)\) to some training data \(Tr={x_{i},y_{i}}_{1}^N\), and we wish to see how well it performs.
* We could compute the average squared prediction error over Tr:
\[ MSE_{Tr} = Ave_{i \in Tr}[y_{i} - \hat f(x_{i})]^2 \]
This may be biased toward more overfit models.
* Instead we should, if possible, compute it using fresh data \(Te = {x_{i},y_{i}}_{1}^M\):
\[ MSE_{Te} = AVE+{i \in Te}[y_{i} - \hat f(x_{i})]^2 \]
Suppose we haec fit a model \(\hat f(x)\) to some trianing data Tr, and let \((x_{0},y_{0})\) be a test observation drawn from the population. If the true model is \(Y = f(X) + \epsilon\) ( with \(f(x) = E(Y\mid X = x)\)), then
\[ E(y_{0} - \hat f(x_{0}))^2 = Var(\hat f(x_{0})) + [Bias(\hat fx_{0}))]^2 + Var(\epsilon) \]
The the expectation averages over the variability of \(y_{0}\), as well as the variability in Tr. Note that the \(Bias(\hat f(x_{0})) = E[\hat f(x_{0})] - f(x_{0})\).
Typically as the flexibility of \(\hat f\) increases, its variance increases and its bias decreases. So choosing the flexibility based on average test error emounts to a bias-variance trade-off.
Here the response variable \(Y\) is qualitative - e.g. email is one of \(C = (spam,ham)\) (ham = good email), digit class is one of \(C = {0,1,...,9}\). Our gols are to:
* Build a classifier \(C(X)\) that assigns a class label from \(C\) to a future unlabeled observation \(X\).
* Assess the uncertainty in each classification
* Understand the roles of the different predictors among \(X = (X_{1},X_{2},...,X_{p})\).
## [1] 2 7 5
## [1] 4 7 10
## [1] 6 14 15
## [1] 0.5 1.0 0.5
## [1] 16 823543 9765625
## [1] 7
## [1] 7 5
## [1] 2 5
## [1] 5
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
## [,1] [,2]
## [1,] 7 11
## [2,] 8 12
## [,1] [,2]
## [1,] 5 9
## [2,] 6 10
## [3,] 7 11
## [4,] 8 12
## [1] 1 2 3 4
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [1] 4 3
## [1] "alpha" "alpha.fn" "Auto"
## [4] "best.fit" "boot.out" "Carseats"
## [7] "coefi" "cv.error" "cv.error10"
## [10] "cv.errors" "cv.lasso" "cv.ridge"
## [13] "d" "dat" "degree"
## [16] "diamond_samp" "diamonds" "diamondsbig"
## [19] "diamondsBigGia" "diamondsurl" "Direction.2005"
## [22] "fit" "fit.lasso" "fit.ridge"
## [25] "fit1" "fit2" "fit3"
## [28] "fit4" "fit5" "fit6"
## [31] "fit7" "folds" "galton"
## [34] "glm.fit" "glm.pred" "glm.probs"
## [37] "hisDiamond" "Hitters" "i"
## [40] "k" "knn.pred" "lam.best"
## [43] "lasso.tr" "lda.fit" "lda.pred"
## [46] "lm.fit" "lm.fit1" "lm.fit2"
## [49] "lm.fit5" "LoadLibraries" "loocv"
## [52] "m1.1" "m2.1" "m3.1"
## [55] "m4.1" "m5.1" "modelEstimate"
## [58] "new.Xy" "pred" "predict.regsubsets"
## [61] "reg.summary" "regfit.full" "regfit.fwd"
## [64] "regplot" "rmse" "rmse.cv"
## [67] "Smarket.2005" "thisDiamond" "train"
## [70] "val.errors" "x" "x.test"
## [73] "XLag" "Xy" "y"
## [76] "z"
## [1] "alpha" "alpha.fn" "Auto"
## [4] "best.fit" "boot.out" "Carseats"
## [7] "coefi" "cv.error" "cv.error10"
## [10] "cv.errors" "cv.lasso" "cv.ridge"
## [13] "d" "dat" "degree"
## [16] "diamond_samp" "diamonds" "diamondsbig"
## [19] "diamondsBigGia" "diamondsurl" "Direction.2005"
## [22] "fit" "fit.lasso" "fit.ridge"
## [25] "fit1" "fit2" "fit3"
## [28] "fit4" "fit5" "fit6"
## [31] "fit7" "folds" "galton"
## [34] "glm.fit" "glm.pred" "glm.probs"
## [37] "hisDiamond" "Hitters" "i"
## [40] "k" "knn.pred" "lam.best"
## [43] "lasso.tr" "lda.fit" "lda.pred"
## [46] "lm.fit" "lm.fit1" "lm.fit2"
## [49] "lm.fit5" "LoadLibraries" "loocv"
## [52] "m1.1" "m2.1" "m3.1"
## [55] "m4.1" "m5.1" "modelEstimate"
## [58] "new.Xy" "pred" "predict.regsubsets"
## [61] "reg.summary" "regfit.full" "regfit.fwd"
## [64] "regplot" "rmse" "rmse.cv"
## [67] "Smarket.2005" "thisDiamond" "train"
## [70] "val.errors" "x" "x.test"
## [73] "XLag" "Xy" "z"
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## [1] 32 11
## [1] "data.frame"
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
## [1] ".GlobalEnv" "Auto" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
Consider the advertising data shown on the next slide.
Questions we might ask:
* Is there a relationship between advertising budget and sales?
* How strong is the relationship betweeen advertising budget and sales?
* Which media contribute to sales?
* How accurately can we predict future sales?
* Is the relationship linear?
* Is there synergy among the advertising media?
That is, there is approximately a 95% chance that the interval
\[ [ \hat\beta_{1} -2 \cdot SE(\hat\beta_{1}), \hat\beta_{1} + 2 \cdot SE(\hat\beta_{1})] \]
will contain the true value of \(\beta_{1}\) (under a scenario where we got repeated samples like the present sample)
For the advertising data, the 95% confindence interval for \(\beta_{1}\) is [0.042,0.05]
Hypothesis Testing
* Standard errors can also be used to perform hypthesis test on the coefficients. The most common hypothesis test involves testing the null hypothesis of
* \(H_{0}:\) There is no relationship between \(X\) and \(Y\) versus the alternative hypothesis
* \(H_{A}:\) There is some relationship between \(X\) and \(Y\)
We compute the Residual Standard Error
\[ RSE = \sqrt{\frac{1}{n -2}RSS} = \sqrt{\frac{1}{n -2}\sum_{i=1}^n (y_{i} - \hat y_{i})^2} \], where the residual sum-of-squares is \(RSS = \sum_{i=1}^n (y_{i}-\hat y_{i})^2\).
It can be shown that in this simple linear regression setting that \(R^2 = r^2\), where \(r\) is the correlation between \(X\) and \(Y\):
\[ r = \frac{\sum_{i=1}^n (x_{i} - \bar x)(y_{i} - \bar y)}{\sqrt{\sum_{i=1}^n (x_{i} - \bar x)^2 \sum_{i=1}^n (y_{i} - \bar y)^2}} \]
\[ sales = \beta_{0} + \beta{1} \times TV + \beta_{2} \times radio + \beta_{3} \times newspaper + \epsilon \]
"Data Analysis and Regression" Mosteller and Tukey 1977
* a regression coefficient \(B_{J}\) estimates the expected change in \(Y\) per unit change in \(X_{j}\), with all other predictors held fixed. But predictors usually change together!
* Example: \(Y\) total amount of change in your pocket: \(X_{1}\) = # of coins; \(X_{2}\) = # of pennies, nickels and dimes. By itself, regression coefficinet of \(Y\) on \(X_{2}\) will be > 0. But how about with \(X_{1}\) in model?
* Example 2: \(Y\) = number of tackles by a football player in a season; \(W\) and \(H\) are his weight and height. Fitted regression model is \(\hat Y = \beta_{0} + .50W - .10H\). How do we interpret \(\hat\beta_{2} < 0\)?
"Essentially, all models are wrong, but some are useful." - George Box
"The onle way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively"
-Fred Mostellyer and John Tukey
For the first question, we can use the F statistic
\[ F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)} ~ F_{p,n-p-1} \]
example:
Quantity | Value ---------|---------- Residual Standard Error | 1.69
\(R^2\) | 0.897 F-Statistic | 570
Qualitative Predictors
* Some predictors are not quantitative but are qualitative, taking a discrete set of values. * These are also called categorical predictors or factor variables.
Example: investigate differences in credit card balances between males and females, ignoring the other variables. We create a new variable
\[ x_{i} = \begin{cases} \beta_{0} + \epsilon_{i}, &\text{if i-th person is female} \\ \beta_{0} + \beta_{1} + \epsilon_{i}, &\text{if i-th person is male} \end{cases} \]
Resulting model:
\[ y_{i}=\beta_{0} + \beta_{1}x_{i} + \epsilon = \begin{cases} \beta_{0} + \epsilon_{i}, &\text{if i-th person is female} \\ \beta_{0} + \beta_{1} + \epsilon_{i}, &\text{if i-th person is male} \end{cases} \]
Intrepretation?
Removing the additive assumption: interactions and nonlinearity
Interactions:
* In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media. * For example, the linear model \[ \hat\text{sales} = \beta_{0} + \beta_{1} \times TV + \beta_{2} \times radio + \beta_{3} \times newspaper \]
states that the everage effect on sales of a one-unit increase in TV is always _{1}, regardless of the amount spent on radio.
Consider the Credit data set, and suppose that we wish to predict balance using income (quantitative) and Student (qualitative). \[ balance_{1} \approx \beta_{0} + \beta_{1} \times income_{i} + \begin{cases} \beta_{2}, &\text{if i-th person is a student} \\ 0, &\text{if i-th person is not a student} \end{cases} \]
\[ = \beta_{1} \times income_{i} + \begin{cases} \beta_{0} + \beta_{2}, &\text{if i-th person is a student} \\ \beta_{0}, &\text{if i-th person is not a student} \end{cases} \]
With interactions, it take the form:
\[ balance_{1} \approx \beta_{0} + \beta_{1} \times income_{i} + \begin{cases} \beta_{2} + \beta_{3} \times income, &\text{if student} \\ 0, &\text{if not a student} \end{cases} \]
\[ =\begin{cases} (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) \times income_{i}, &\text{if student} \\ \beta_{0} + \beta_{1} \times income_{i}, &\text{if i-th person is not a student} \end{cases} \]
In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit:
* Classification problems: logistic regression, suppot vector machines
* Non-linearity: kernel smoothing, splines and generalized additive models, nearest neighbor methods.
* Interactions: Tree-based methods, bagging, random forests and boosting ( these capture non-linearities)
* Regularized fitting: Ridge regression and lasso
## Warning: package 'MASS' was built under R version 3.1.3
##
## Attaching package: 'ISLR'
##
## The following objects are masked _by_ '.GlobalEnv':
##
## Auto, Hitters
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75
## Median :272.0 Median :117.0 Medium:219 Median :54.50
## Mean :264.8 Mean :115.8 Mean :53.32
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00
## Max. :509.0 Max. :191.0 Max. :80.00
## Education Urban US
## Min. :10.0 No :118 No :142
## 1st Qu.:12.0 Yes:282 Yes:258
## Median :14.0
## Mean :13.9
## 3rd Qu.:16.0
## Max. :18.0
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Classification
* Qualitative variables take values in an ordered set \(C\), such as: \[ \text{eye Color} \in {brown,blue,green} \] \[ \text{email } \in {spam, ham} \]
* Given a featrue vector \(X\) and a qualitative response \(Y\) taking values in the set \(C\), the classification task is to build a function \(C(X)\) that takes as input the feature vector \(X\) and predicts its value for \(Y\); i.e. \(C(X) \in C\).
* Often we are more interested in estimating the probabilities that \(X\) belongs to each category \(C\).
* For example, it's more valuable to have an estimate of the probability that an insurance claim is fraudulent, than a classification fraudulent or not.
Suppose for the Default classification task we code
\[ Y = \begin{cases} 0, &\text{if NO} \\ 1, &\text{if YES} \end{cases} \]
Can we simply perform a linear regressio of \(Y\) on \(X\) and classify as Yes if \(\hat Y > 0.5\)?
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.
\[ Y = \begin{cases} 1, &\text{if stroke} \\ 2, &\text{if drug overdose} \\ 3, &\text{if epileptic seizure} \end{cases} \]
This coding suggest on ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.
Let's write \(p(X) = Pr(Y = 1 \mid X)\) for short and consider using balance to predict default. Logistic regression uses the form
\[ p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \]
(\(e \approx 2.71828\) is a mathematical constant [Euler's number.]) It is easy to see that no matter what values \(\beta_{0},\beta_{1}\) or \(X\) take, \(p(X)\) will have values between 0 and 1.
A bit of rearrangement gives \[log(\frac{p(x)}{1 - p(x)}) = \beta_{0} + \beta_{1}X \]
This monotone transformation is called the log odds or logit transformation of \(p(x)\).
We use maximum likelihood to estimate the parameters.
\[ \mathbb{l}(\beta_{0},\beta) = \prod_{i:y_{i}=1}p(x_{i}) \prod_{i:y_{i}=0} (1 - p(x_{i})) \]
This likelihood gives the probability of the observed zeros and ones in the data. We pick \(\beta_{0}\) and \(\beta_{1}\) to maximize the likelihood of the observed data.
Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function.
What is our estimated probability of default for someone with a balance of $1000?
\[ \hat p(X) = \frac{e^{\hat\beta_{0}+\hat\beta_{1}X}}{1 + e^{\hat\beta_{0}+\hat\beta_{1}X}} = \frac{e^{-10.6513+0.0055\times1000}}{1 + e^{-10.6513+0.0055\times1000}} = 0.006 \]
With a balance of $2000?
\[ \hat p(X) = \frac{e^{\hat\beta_{0}+\hat\beta_{1}X}}{1 + e^{\hat\beta_{0}+\hat\beta_{1}X}} = \frac{e^{-10.6513+0.0055\times2000}}{1 + e^{-10.6513+0.0055\times2000}} = 0.586 \]
Lets do it again, using Student as the predictor.
\[\widehat{PR}(default=Yes\mid student=Yes) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}} = 0.0431\], \[\widehat{PR}(default=Yes\mid student=Nes) = \frac{e^{-3.5041+0.4049\times0}}{1+e^{-3.5041+0.4049\times0}} = 0.0431\]
Logistic regression with several variables
\[ log(\frac{p(X)}{1-p(X)}) = \beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p} \]
\[ p(X) = \frac{e^{\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}}}{1 + e^{\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}}} \]
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
\[ Pr(Y = k \mid X) = \frac{e^{\beta_{0k}+\beta_{1k}X_{1}+...+\beta_{pk}X_{p}}}{\sum_{l=1}^K e^{\beta_{0l}+\beta_{1l}X_{1l}+...+\beta_{pl}X_{p}} \]
Here there is a linear function for each class.
Multiclass logistic regression is also referred to as multinomial regression.
Here the approach is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(Pr(Y\mid X)\).
When we use normal (Gaussain) distributions for eachg class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probablistic modeling. Here we focus on a simple result, known as Bayes theorem:
\[Pr(Y = k \mid X = x) = \frac{Pr(X = x \mid Y = k) \cdot Pr(Y = l)}{Pr(X = x)} \]
One writes this slightly differently for discriminant analysis:
\[ Pr(Y = k \mid X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^K \pi_{l}f_{l}(x)} \], where * \(f_{k}(x) = Pr(X = x \mid Y = k)\) is the density for \(X\) in class \(k\). Here we will use normal densities for these, seperately in each class.
* \(\pi_{k} = Pr(Y = k)\) is the marginal or prior probability for class k.
We classify a new point according to which density is highest.
When the priors are different, we take them into account as well, and compare \(\pi_{k}f_{k}(x)\). On the right, we favor the pink class - the decision boundary has shifted to the left.
The Gausssian density has the form
\[ f_{k}(x) = \frac{1}{\sqrt{2\pi\sigma_{k}}}e^{-\frac{1}{2}(\frac{x - \mu_{k}}{\sigma_{k}})^2} \]
Here \(\mu_{k}\) is the mean, and \(\sigma_{k}^2\) the variance (in class \(k\)). We will assume that all the \(\sigma_{k} = \sigma\) are the same.
Plugging this into Bayes formula, we get a rather complex expression for \(p_{k}(x) = Pr(Y = k \mid X = x)\):
\[ p_{k}(x)=\frac{\pi_{k}\frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_{k}}{\sigma})^2}}{\sum_{l=1}^K \pi_{l}\frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x - \mu_{l}}{\sigma})^2}} \]
Happily, there are simplifications and cancellations.
To classify at the value of \(X = x\), we need to see which of the \(p_{x}(x)\) largest. Taking logs, and discarding terms that do not dpened on \(k\), we see that this is equivalent to assigning \(x\) to the class with the largest discriminant score:
\[ \delta_{k}(x) = x \cdot \frac{\mu_{k}}{\sigma^2} - \frac{\mu_{k}^2}{2\sigma^2} + log(\pi_{k}) \]
Note that \(\delta_{k}(x)\) is a linear function of x.
If there are \(K =2\) classes and \(\pi_{1} = \pi_{2} = 0.5\), then one can see that the decision boundary is at
\[ x = \frac{\mu_{1} + \mu_{2}}{2} \]
\[ \hat\pi_{k} = \frac{n_{k}}{n} \]
\[ \hat\mu_{k} = \frac{1}{n_{k}}\sum_{i:y_{i}=k} x_{i} \]
\[ \hat\sigma^2 = \frac{1}{n - K}\sum_{k=1}^K\sum_{i:y_{i}=k}(x_{i}-\hat\mu_{k})^2 \]
\[ = \sum_{k=1}^K\frac{n_{k} -1}{n - K} \cdot \hat\sigma_{k}^2 \]
where \(\sigma_{k}^2 = \frac{1}{n_{k}-1}\sum_{i:y_{i}=k}(x_{i}-\hat\mu_{k})^2\) is the usual formula for the estimated variance in the kth class.
Density: \(f(x) = \frac{1}{(2\pi)^{\frac{p}{2}|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{1}(x - \mu)^T\Sigma^-1(x - \mu)}\)
Discriminant function: \(\delta_{k}(x) = x^T\Sigma^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^T\Sigma^{-1}\mu_{k}+log(\pi_{k})\)
It looks complex, but important thing to note is that it's, again, linear in x. Here's x alone multiplied by a coefficient vector(\(\Sigma^{-1}\mu_{k}\)), and these (\(-\frac{1}{2}\mu_{k}^T\Sigma^{-1}\mu_{k}+log(\pi_{k})\)) are all just constants. So this, again, is a linear function.
Despite its complex form,
\(\delta_{k}(x)= c_{k0}+c_{k1}x_{2}+...+c_{kp}x_{p}\) - a linear function.
When there are \(K\) classes, linear discriminant analysis can be viewed exactly in a \(K-1\) dimensional plot. Why? Because it essentially classifies to the closest centriod, and they span a \(K-1\) dimensional plane. Even when \(K > 3\), we can fine the 'best' 2-dimensional plane for visualizing the discriminant rule.
Once we have estimates \(\hat\delta_{k}(x)\), we can turn these into estimates for class probabilities:
\[ \widehat{Pr}(Y = k\mid X = x) = \frac{e^{\hat\delta_{k}(x)}}{\sum_{l=1}^K e^{\hat\delta_{l}(x)}} \]
So classifying the largest \(\hat\delta_{k}(x)\) amounts to classifying to the class for which $ (Y = k X = x)$ is largest.
When \(K = 2\), we classify to class 2 if \(\widehat{Pr}(Y = 2 \mid X = x) \geq 0.5\) else to class 1.
(23 + 252)10000 errors -- a 2.75% misclassification rate!
Some Caveats:
* This is training error, and we may be overfitting. Not a big concern here since n = 10000, and p = 4!
* If we classsified to the prior -- always to class No in this case -- we would make 333/10000 errors, or only 3.33%. This called the null rate or the naive classifier. This is the rate of the prior our model has to at least outperform this rate or we are just wasting our time.
* Of teh true No's, we make 23/9667 = 0.2$ errors: of the true Yes's, we make 252/333 = 75.7% errors!
False positive rate: The fraction of negative examples that are classified as positive -- 0.2$ in example.
False negative rate: The fraction of positive examples that are classified as negative -- 75.7% in example.
We produced this table by classifying to class Yes if
\[ \widehat{Pr}(Default=Yes\mid Balance,Student) \geq 0.5\]
We can change the two error rates by changing the threshold from 0.5 to some other value in [0,1]:
\[ \widehat{Pr}(Default = Yes \mid Balance, Student) \geq threshold \]
and vary threshold.
The ROC plot displays both simultaneously
Sometime we use the AUC* or area under the curve to summarize the overall performance. Higher AUC* is good.
\[ Pr(Y = k \mid X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^K \pi_{l}f_{l}(x)} \]
When \(f_{k}(x)\) are Gaussian densities, with the same covariance matrix \(\Sigma\), this leads to linear discriminant analysis. By altering the forms for \(f_{k}(x)\), we get different classifiers..
* With Gaussians but different \(\delta_{k}\) in each class, we get quadratic discriminant analysis.
* With \(f_{k}(x) = \prod_{j=1}^p f_{jk}(x_{j})\) (conditional independence model) in each class we get naive Bayes. For Gaussian this means the \(\Sigma_{k}\) are diagonal.
* Many other forms, by proposing specific density models for \(f_{k}(x)\), including nonparametric approaches.
\[ \delta_{k}(x) = -\frac{1}{2}(x - \mu_{k})^T \Sigma_{k}^{-1} (x - \mu_{k}) + log(\pi_{k}) - \frac{1}{2}log|\Sigma_{k}| \]
Because the \(\Sigma_{k}\) are different, the quadratic terms matter.
Assumes features are independent in each class. Useful when p is large, and so multivariate methods like QDA and even LDA break down.
* Gaussian naive Bayes assumes each \(\Sigma_{k}\) is diagonal:
\[ delta_{k}(x) \propto log\bigg[ \pi_{k}\prod_{j=1}^p f_{kj}(x_{j})\bigg] \] \[ = -\frac{1}{2}\sum_{j=1}^p\Big[\frac{(x_{j}-\mu_{kj})^2}{\sigma_{kj}^2} + \simga_{kj}^2\Big] + log\pi_{k} \]
* can use for mixed feature vectors (qualitative and quantitative). if \(X_{j}\) i qualitative, replace \(f_{kj}(x_{j})\) with the probability mass function ( histogram) over discrete categories.
Despite strong assumptions, naive Bayes ofter produces good classification results.
For a two-class problem, one can show that for LDA
\[ log\big( \frac{p_{1}(x)}{1 - p_{1}(x)}\big) = log\big(\frac{p_{1}(x)}{p_{2}(x)}\big) = c_{0} + c_{1}x_{1} +...+ c_{p}x_{p} \]
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.
* Logistic regression uses the conditional likelihood based on \(Pr(Y\mid X)\) (known as discriminant learning).
* LDA uses the full likelihood based on \(Pr(X,Y)\) (known as generative learning).
* Despite these differences, in practice the results are ofter very similar.
Footnote: logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
## Year Lag1 Lag2
## Min. :2001 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
## Median :2003 Median : 0.039000 Median : 0.039000
## Mean :2003 Mean : 0.003834 Mean : 0.003919
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000
## Lag3 Lag4 Lag5
## Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
## 1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
## Median : 0.038500 Median : 0.038500 Median : 0.03850
## Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
## 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
## Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
## Volume Today Direction
## Min. :0.3561 Min. :-4.922000 Down:602
## 1st Qu.:1.2574 1st Qu.:-0.639500 Up :648
## Median :1.4229 Median : 0.038500
## Mean :1.4783 Mean : 0.003138
## 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. :3.1525 Max. : 5.733000
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
## 1 2 3 4 5
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
## [1] 0.5216
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
## [1] 0.4801587
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
## [1] 0.5595238
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = Year <
## 2005)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
## [1] "list"
## class posterior.Down posterior.Up LD1
## 999 Up 0.4901792 0.5098208 0.08293096
## 1000 Up 0.4792185 0.5207815 0.59114102
## 1001 Up 0.4668185 0.5331815 1.16723063
## 1002 Up 0.4740011 0.5259989 0.83335022
## 1003 Up 0.4927877 0.5072123 -0.03792892
##
## Down Up
## Down 35 35
## Up 76 106
## [1] 0.5595238
## Warning: package 'class' was built under R version 3.1.2
## The following objects are masked from Smarket (position 4):
##
## Direction, Lag1, Lag2, Lag3, Lag4, Lag5, Today, Volume, Year
## Lag1 Lag2
## [1,] 0.381 -0.192
## [2,] 0.959 0.381
## [3,] 1.032 0.959
## [4,] -0.623 1.032
## [5,] 0.614 -0.623
##
## knn.pred Down Up
## Down 43 58
## Up 68 83
## [1] 0.5
##
## knn.pred Down Up
## Down 48 56
## Up 63 85
## [1] 0.5277778
Can we apply cross-validation in step 2, forgetting about step1?
NO!
* This would ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be inclued in the validation process.
* It is easy to simulate realistic data with the class labels independent of the outcome, so that true test error = 50% but the CV error estimate that ignores Step 1 is zero!
## Loading required package: boot
## Warning: package 'boot' was built under R version 3.1.3
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## [1] 17.25330 17.19723
## [1] 17.2533
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## [1] 0.5758321
## [1] 0.5758321
## [1] 0.5963833
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5758321 -7.315422e-05 0.08861826
## [1] "X1" "X2" "y"
##
## Call:
## lm(formula = y ~ X1 + X2, data = Xy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44171 -0.25468 -0.01736 0.33081 1.45860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26583 0.01988 13.372 < 2e-16 ***
## X1 0.14533 0.02593 5.604 2.71e-08 ***
## X2 0.31337 0.02923 10.722 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5451 on 997 degrees of freedom
## Multiple R-squared: 0.1171, Adjusted R-squared: 0.1154
## F-statistic: 66.14 on 2 and 997 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = y ~ X1 + X2, data = Xy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44171 -0.25468 -0.01736 0.33081 1.45860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26583 0.01988 13.372 < 2e-16 ***
## X1 0.14533 0.02593 5.604 2.71e-08 ***
## X2 0.31337 0.02923 10.722 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5451 on 997 degrees of freedom
## Multiple R-squared: 0.1171, Adjusted R-squared: 0.1154
## F-statistic: 66.14 on 2 and 997 DF, p-value: < 2.2e-16
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Xy, statistic = alpha.fn, R = 2000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.1453263 -0.0002776672 0.02895703
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = new.Xy, statistic = alpha.fn, R = 2000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.1453263 -0.008855075 0.1934213
Best subset and stepwise model selection procedures
Best Subset Selection
1. Let \(M_{0}\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
2. for \(k = 1,2,..p\):
a. Fit all $ p k$ models that contain exactly \(k\) predictors. b. Pick the best among these $ p k$ models, and call it \(M_{k}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
3. Select a single best model among \(M_{0},...,M_{p}\) using cross-validated prediction error, \(C_{p}\) (AIC), BIC, or adjusted \(R^2\).
Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression.
The devience-negative two times the maximized log-likelihood - plays the role of RSS for a broader class of models.
Forward Stepwise Selection
1. Let \(M_{0}\) denote the null model, which contains no predictors.
2. For \(k = 0,...,p-1\):
2.1 Consider all \(p-k\) models that augment the predictors in \(M_{k}\) with one additional predictor.
2.2 Choose the best among these \(p-k\) models, and call it \(M_{k+1}\). Here best is defined as having smallest RSS or highest \(R^2\).
3. Select a single best model from among \(M_{0},...,M_{p}\) using cross-validated prediction error, \(C_{P}\) (AIC), BIC, or adjusted \(R^2\).
Backward Stepwise Selection
1. Let \(M_{p}\) denote the full model, which contains all \(p\) predictors.
2. For \(k=p,p-1,...,1\):
2.1 Consider all \(k\) models that contain all but one of the predictors in \(M_{k}\), for a total of \(k-1\) predictors.
2.2 Choose the best among these k models, and call it \(M_{k-1}\). Here best is defined as having smallest RSS or highest \(R^2\).
3. Select a single best model from among \(M_{0},...M_{p}\) using cross-validated prediction error, \(C_{P}\) (AIC), BIC, or adjusted \(R^2\).
\[ BIC = \frac{1}{n}(RSS + log(n)d\hat\sigma^2) \].
* Like \(C_{p}\), the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.
* Notice that BIC replaces the \(2d\hat\sigma^2\) used by \(C_{p}\) with a \(log(n)d\hat\sigma^2\) term, where \(n\) is the number of observations.
* Since \(log n > 2\) for any \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_{p}\).
It can also be used in a wider range of model selection taks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance \(\simga^2\).
One-standard-error-rule. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. What is the rational for this?
Ridge Regression and Lasso
* The subset selection method use least squares to fit a linear model that contains a subset of the predictors.
* As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks and coefficient estimates towards zero.
* It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.
Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coefficient estimates solve the problems
\[ \underset{x}{\text{minimize}} = \sum_{i=1}^n \Big( y_{i}-\beta_{0}-\sum_{j=1}^p \beta_{j}x_{ij} \Big)^2 \text{subject to} \sum_{j=1}^p |\beta_{j}|\]
and
\[ \underset{x}{\text{minimize}} = \sum_{i=1}^n \Big( y_{i}-\beta_{0}-\sum_{j=1}^p \beta_{j}x_{ij} \Big)^2 \text{subject to} \sum_{j=1}^p \beta_{j}^2 \leq s \]
respectively.
library(ISLR)
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50
## Median :413.0 Median :103.0 Median : 9.00 Median : 52.00
## Mean :403.6 Mean :107.8 Mean :11.62 Mean : 54.75
## 3rd Qu.:526.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00
## Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 30.00 1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5
## Median : 47.00 Median : 37.00 Median : 6.000 Median : 1931.0
## Mean : 51.49 Mean : 41.11 Mean : 7.312 Mean : 2657.5
## 3rd Qu.: 71.00 3rd Qu.: 57.00 3rd Qu.:10.000 3rd Qu.: 3890.5
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 2.0 Min. : 3.0
## 1st Qu.: 212.0 1st Qu.: 15.00 1st Qu.: 105.5 1st Qu.: 95.0
## Median : 516.0 Median : 40.00 Median : 250.0 Median : 230.0
## Mean : 722.2 Mean : 69.24 Mean : 361.2 Mean : 330.4
## 3rd Qu.:1054.0 3rd Qu.: 92.50 3rd Qu.: 497.5 3rd Qu.: 424.5
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.0
## CWalks League Division PutOuts Assists
## Min. : 1.0 A:139 E:129 Min. : 0.0 Min. : 0.0
## 1st Qu.: 71.0 N:124 W:134 1st Qu.: 113.5 1st Qu.: 8.0
## Median : 174.0 Median : 224.0 Median : 45.0
## Mean : 260.3 Mean : 290.7 Mean :118.8
## 3rd Qu.: 328.5 3rd Qu.: 322.5 3rd Qu.:192.0
## Max. :1566.0 Max. :1377.0 Max. :492.0
## Errors Salary NewLeague
## Min. : 0.000 Min. : 67.5 A:141
## 1st Qu.: 3.000 1st Qu.: 190.0 N:122
## Median : 7.000 Median : 425.0
## Mean : 8.593 Mean : 535.9
## 3rd Qu.:13.000 3rd Qu.: 750.0
## Max. :32.000 Max. :2460.0
There are some missing values here, so before we proceed we will remove them:
Hitters=na.omit(Hitters)
with(Hitters, sum(is.na(Salary)))
## [1] 0
we will now use the package 'leaps' to evaluate all the best-subset models.
#install.packages("leaps")
library(leaps)
regfit.full = regsubsets(Salary~.,data=Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables
regfit.full=regsubsets(Salary~.,data=Hitters, nvmax=19)
reg.summary = summary(regfit.full)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
plot(reg.summary$cp, xlab="number of Variables", ylab="Cp")
which.min(reg.summary$cp)
## [1] 10
points(10,reg.summary$cp[10],pch=20,col="red")
There is a plot method for the 'regsubsets' object
plot(regfit.full,scale="Cp")
coef(regfit.full,10)
## (Intercept) AtBat Hits Walks CAtBat
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798
## CRuns CRBI CWalks DivisionW PutOuts
## 1.4082490 0.7743122 -0.8308264 -112.3800575 0.2973726
## Assists
## 0.2831680
Here we use the 'regsubsets' function but specify the 'method = "forward"' option:
regfit.fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
plot(regfit.fwd, scale = "Cp")
Lets make a training and validation set, so that we can choose a good subset model.
We will do it using a slightly different approach from what was done in the book.
dim(Hitters)
## [1] 263 20
set.seed(1)
train = sample(seq(263),180,replace=FALSE)
train
## [1] 70 98 150 237 53 232 243 170 161 16 259 45 173 97 192 124 178
## [18] 245 94 190 228 52 158 31 64 92 4 91 205 80 113 140 115 43
## [35] 244 153 181 25 163 93 184 144 174 122 117 251 6 104 241 149 102
## [52] 183 224 242 15 21 66 107 136 83 186 60 211 67 130 210 95 151
## [69] 17 256 207 162 200 239 236 168 249 73 222 177 234 199 203 59 235
## [86] 37 126 22 230 226 42 11 110 214 132 134 77 69 188 100 206 58
## [103] 44 159 101 34 208 75 185 201 261 112 54 65 23 2 106 254 257
## [120] 154 142 71 166 221 105 63 143 29 240 212 167 172 5 84 120 133
## [137] 72 191 248 138 182 74 179 135 87 196 157 119 13 99 263 125 247
## [154] 50 55 20 57 8 30 194 139 238 46 78 88 41 7 33 141 32
## [171] 180 164 213 36 215 79 225 229 198 76
regfit.fwd = regsubsets(Salary~.,data=Hitters[train,], nvmax=19, method = "forward")
Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for 'regsubsets'.
val.errors=rep(NA,19)
x.test=model.matrix(Salary~., data=Hitters[-train,]) # notice the index!
for(i in 1:19){
coefi = coef(regfit.fwd, id = i)
pred = x.test[,names(coefi)]%*%coefi
val.errors[i]=mean((Hitters$Salary[-train]-pred)^2)
}
plot(sqrt(val.errors), ylab="Root MSE", ylim=c(300,400), pch=19, type="b")
points(sqrt(regfit.fwd$rss[-1]/180), col="blue", pch=19,type="b")
legend("topright", legend=c("Training", "Validation"), col=c("blue","black"),pch=19)
As we expect, the training error goes down monotonically as the model gets bigger, but not so for the validation error.
This was a little tedious - not having a predict method for 'regsubsets'.
So we will write one!
predict.regsubsets=function(object, newdata,id,...){
form=as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi=coef(object,id=id)
mat[,names(coefi)]%*%coefi
}
We will do 10-fold cross-validation. Its really easy!
set.seed(11)
folds = sample(rep(1:10, length=nrow(Hitters)))
folds
## [1] 3 1 4 4 7 7 3 5 5 2 5 2 8 3 3 3 9 2 9 8 10 5 8
## [24] 5 5 5 5 10 10 4 4 7 6 7 7 7 3 4 8 3 6 8 10 4 3 9
## [47] 9 3 4 9 8 7 10 6 10 3 6 9 4 2 8 2 5 6 10 7 2 8 8
## [70] 1 3 6 2 5 8 1 1 2 8 1 10 1 2 3 6 6 5 8 8 10 4 2
## [93] 6 1 7 4 8 3 7 8 7 1 10 1 6 2 9 10 1 7 7 4 7 4 10
## [116] 3 6 10 6 6 9 8 10 6 7 9 6 7 1 10 2 2 5 9 9 6 1 1
## [139] 2 9 4 10 5 3 7 7 10 10 9 3 3 7 3 1 4 6 6 10 4 9 9
## [162] 1 3 6 8 10 8 5 4 5 6 2 9 10 3 7 7 6 6 2 3 2 4 4
## [185] 4 4 8 2 3 5 9 9 10 2 1 3 9 6 7 3 1 9 4 10 10 8 8
## [208] 8 2 5 9 8 10 5 8 2 4 1 4 4 5 5 2 1 9 5 2 9 9 5
## [231] 3 2 1 9 1 7 2 5 8 1 1 7 6 6 4 5 10 5 7 4 8 6 9
## [254] 1 2 5 7 1 3 1 3 1 2
table(folds)
## folds
## 1 2 3 4 5 6 7 8 9 10
## 27 27 27 26 26 26 26 26 26 26
cv.errors = matrix(NA,10,19) # make matrix for our errors
for(k in 1:10){
best.fit=regsubsets(Salary~., data=Hitters[folds!=k,], nvmax=19, method="forward")
for(i in 1:19){
pred = predict(best.fit, Hitters[folds==k,], id=i)
cv.errors[k,i] = mean( (Hitters$Salary[folds==k] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors,2,mean))
plot(rmse.cv,pch=19,type="b")
We will use the package glment, which does not use the model formula language, so we will set up an x and y.
#install.packages("glmnet", repos = "http://cran.us.r-project.org")
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.1.3
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:base':
##
## crossprod, tcrossprod
##
## Loading required package: foreach
## Loaded glmnet 2.0-2
x = model.matrix(Salary~., data = Hitters)
y = Hitters$Salary
First we will fit a ridge-regression model. This is achieved by calling glmnet with alpha=0 (see the helpfile). There is also a cv.glmnet function which will do the cross-validation for us.
fit.ridge = glmnet(x,y,alpha=0) # 0 ridge 1 lasso. between are elastic net
plot(fit.ridge,xvar="lambda", label=TRUE)
cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)
Now we fit a lasso model; for this we use the default
alpha=1
fit.lasso=glmnet(x,y)
plot(fit.lasso, xvar="lambda", label=TRUE)
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
coef(cv.lasso)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 127.95694754
## (Intercept) .
## AtBat .
## Hits 1.42342566
## HmRun .
## Runs .
## RBI .
## Walks 1.58214111
## Years .
## CAtBat .
## CHits .
## CHmRun .
## CRuns 0.16027975
## CRBI 0.33667715
## CWalks .
## LeagueN .
## DivisionW -8.06171262
## PutOuts 0.08393604
## Assists .
## Errors .
## NewLeagueN .
Suppose we want to use our earlier train/validation division to select the lambda for the lasso.
This is easy to do
lasso.tr=glmnet(x[train,],y[train])
lasso.tr
##
## Call: glmnet(x = x[train, ], y = y[train])
##
## Df %Dev Lambda
## [1,] 0 0.00000 246.40000
## [2,] 1 0.05013 224.50000
## [3,] 1 0.09175 204.60000
## [4,] 2 0.13840 186.40000
## [5,] 2 0.18000 169.80000
## [6,] 3 0.21570 154.80000
## [7,] 3 0.24710 141.00000
## [8,] 3 0.27320 128.50000
## [9,] 4 0.30010 117.10000
## [10,] 4 0.32360 106.70000
## [11,] 4 0.34310 97.19000
## [12,] 4 0.35920 88.56000
## [13,] 5 0.37360 80.69000
## [14,] 5 0.38900 73.52000
## [15,] 5 0.40190 66.99000
## [16,] 5 0.41260 61.04000
## [17,] 5 0.42140 55.62000
## [18,] 5 0.42880 50.67000
## [19,] 5 0.43490 46.17000
## [20,] 5 0.43990 42.07000
## [21,] 5 0.44410 38.33000
## [22,] 5 0.44760 34.93000
## [23,] 6 0.45140 31.83000
## [24,] 7 0.45480 29.00000
## [25,] 7 0.45770 26.42000
## [26,] 7 0.46010 24.07000
## [27,] 8 0.46220 21.94000
## [28,] 8 0.46380 19.99000
## [29,] 8 0.46520 18.21000
## [30,] 8 0.46630 16.59000
## [31,] 8 0.46730 15.12000
## [32,] 8 0.46810 13.78000
## [33,] 9 0.47110 12.55000
## [34,] 9 0.47380 11.44000
## [35,] 9 0.47620 10.42000
## [36,] 10 0.48050 9.49500
## [37,] 9 0.48450 8.65200
## [38,] 10 0.48770 7.88300
## [39,] 10 0.49360 7.18300
## [40,] 11 0.49890 6.54500
## [41,] 12 0.50450 5.96300
## [42,] 12 0.51010 5.43400
## [43,] 13 0.51470 4.95100
## [44,] 13 0.51850 4.51100
## [45,] 13 0.52170 4.11000
## [46,] 14 0.52440 3.74500
## [47,] 14 0.52670 3.41200
## [48,] 14 0.52870 3.10900
## [49,] 14 0.53030 2.83300
## [50,] 14 0.53160 2.58100
## [51,] 15 0.53280 2.35200
## [52,] 16 0.53420 2.14300
## [53,] 17 0.53580 1.95300
## [54,] 17 0.53760 1.77900
## [55,] 17 0.53890 1.62100
## [56,] 17 0.54000 1.47700
## [57,] 17 0.54090 1.34600
## [58,] 17 0.54160 1.22600
## [59,] 17 0.54220 1.11700
## [60,] 17 0.54280 1.01800
## [61,] 17 0.54320 0.92770
## [62,] 17 0.54360 0.84530
## [63,] 17 0.54380 0.77020
## [64,] 18 0.54410 0.70180
## [65,] 18 0.54430 0.63940
## [66,] 18 0.54450 0.58260
## [67,] 18 0.54470 0.53090
## [68,] 18 0.54490 0.48370
## [69,] 19 0.54510 0.44070
## [70,] 19 0.54520 0.40160
## [71,] 19 0.54530 0.36590
## [72,] 19 0.54540 0.33340
## [73,] 19 0.54550 0.30380
## [74,] 19 0.54560 0.27680
## [75,] 19 0.54570 0.25220
## [76,] 19 0.54570 0.22980
## [77,] 19 0.54580 0.20940
## [78,] 19 0.54580 0.19080
## [79,] 19 0.54590 0.17380
## [80,] 19 0.54590 0.15840
## [81,] 19 0.54590 0.14430
## [82,] 19 0.54590 0.13150
## [83,] 19 0.54600 0.11980
## [84,] 18 0.54600 0.10920
## [85,] 18 0.54600 0.09948
## [86,] 18 0.54600 0.09064
## [87,] 18 0.54600 0.08259
## [88,] 19 0.54600 0.07525
## [89,] 19 0.54600 0.06856
pred=predict(lasso.tr,x[-train,])
dim(pred)
## [1] 83 89
rmse = sqrt(apply((y[-train]-pred)^2,2,mean)) # y is broadcast to the dimension of pred
plot(log(lasso.tr$lambda), rmse, type="b", xlab="Log(lambda)")
lam.best = lasso.tr$lambda[order(rmse)[1]]
lam.best
## [1] 19.98706
coef(lasso.tr, s = lam.best)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 77.8923666
## (Intercept) .
## AtBat .
## Hits 0.1591252
## HmRun .
## Runs .
## RBI 1.7340039
## Walks 3.4657091
## Years .
## CAtBat .
## CHits .
## CHmRun .
## CRuns 0.5386855
## CRBI .
## CWalks .
## LeagueN 30.0493021
## DivisionW -113.8317016
## PutOuts 0.2915409
## Assists .
## Errors .
## NewLeagueN 2.0367518
Moving Beyond Linearity
The truth is never linear!
Or almost never!
But often the linearity assumption is good enough.
When its not...
* polynomials,
* step functions,
* splines,
* local regression, and
* generalized additive models
offer a lot of flexibility, without losing the ease and interpretability of linear models.
\(y_{i}=\beta_{0}+\beta_{1}x_{i}+\beta_{2}x_{i}^2+\beta_{3}x_{i}^3+...+\beta_{d}x_{i}^d+\epsilon_{i}\)
Not really interested in the coefficients; more interested in the fitted function values at any value \(x_{0}\):
\[ \hat f(x_{0})=\hat\beta_{0}+\hat\beta_{2}^2+\hat\beta_{3}^3+\hat\beta_{4}^4 \]
We either fix the degree \(d\) at some reasonably low value, else use cross-validation to choose \(d\).
{r}y ~ poly(x,degree = 3) in formula.Another way of creating transformations of a variable- cut the variable into distinct regions
\[ C_{1}(X)=I(X<35), C_{2}(X)=I(35\leq X < 50),...,C_{3}(X)=I(X \geq 65) \]
* Easy to work with. Creates a series of dummy variables representing each group.
* Useful way of creating interactions that are easy to interpret. For example, interaction effect of Year and Age:
\[ I(\text{Year}<2005) \cdot \text{Age}, I(\text{Year} \geq 2005) \cdot \text{Age}} \]
would allow for different linear functions in each age category.
* In R:{r}I(year < 2005) or {r}cut(age,c(18,25,40,65,90)).
* Choice of cutpoints or Knots can be problematic. For creating nonlinearities, smoother alternatives such as splines are available.
Install and load necessary datasets and functions
library("MASS")
library("ISLR")
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.1.3
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'Auto':
##
## mpg
library("car")
## Warning: package 'car' was built under R version 3.1.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:boot':
##
## logit
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.1.2
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyr")
## Warning: package 'tidyr' was built under R version 3.1.2
##
## Attaching package: 'tidyr'
##
## The following object is masked from 'package:Matrix':
##
## expand
The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).
#View(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
#?Boston
We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic lm() syntax is lm(y∼x,data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept.
lm.fit = lm(medv~lstat, data = Boston)
lm.fit
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quan- tities by name—e.g. lm.fit$coefficients—it is safer to use the extractor functions like coef() to access them.
names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.
predict(lm.fit, data.frame(lstat=(c(5,10,15))), interval = "confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat=(c(5,10,15))), interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.
Next we examine some diagnostic plots, several of which were discussed in Section 3.3.3. Four diagnostic plots are automatically produced by ap- plying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() function, which tells R to split the display screen into separate panels so that multiple plots can be viewed si- multaneously. For example, par(mfrow=c(2,2)) divides the plotting region into a 2 × 2 grid of panels.
par(mfrow=c(2,2))
plot(lm.fit)
Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.
On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues() function.
## 375
## 375
In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y∼x1+x2+x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.
lm.fit=lm(medv ~ lstat+age,data=Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
lm.fit=lm(medv ~., data=Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)\(r.sq gives us the R2, and summary(lm.fit)\)sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages option in R. 
?summary.lm
summary(lm.fit)$r.sq
## [1] 0.7406427
summary(lm.fit)$sigma
## [1] 4.745298
vif(lm.fit)
## crim zn indus chas nox rm age dis
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
## rad tax ptratio black lstat
## 7.484496 9.008554 1.799084 1.348521 2.941491
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.
lm.fit1=lm(medv~.-age,data=Boston)
summary(lm.fit1)
##
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
Alternatively, the update() function can be used.
#lm.fit1= update(lm.fit , ~. -age)
It is easy to include interaction terms in a linear model using the lm() func- tion. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat, age, and the interaction term lstat×age as predictors; it is a shorthand for lstat+age+lstat:age.
summary(lm(medv ~ lstat*age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor X, we can create a predictor X2 using I(X^2). The function I() is needed since the ^ has a special meaning in a formula; wrapping as we do allows the standard usage in R, which is I() to raise X to the power 2. We now perform a regression of medv onto lstat and lstat2.
lm.fit2 = lm(medv~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
lm.fit <- lm(medv ~ lstat, data = Boston)
anova(lm.fit, lm.fit2)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat2 is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat. If we type
par(mfrow = c(2,2))
plot(lm.fit2)
In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher- order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:
lm.fit5 = lm(medv ~ poly (lstat, 5), data = Boston)
summary(lm.fit5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.
summary(lm(medv ~ log(rm), data = Boston))
##
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.487 -2.875 -0.104 2.837 39.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.488 5.028 -15.21 <2e-16 ***
## log(rm) 54.055 2.739 19.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
## F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16
We will now examine the Carseats data, which is part of the ISLR library. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.
#fix(Carseats)
#View(Carseats)
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
The Carseats data includes qualitative predictors such as Shelveloc, an in- dicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The pre- dictor Shelveloc takes on three possible values, Bad, Medium, and Good.
Given a qualitative variable such as Shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.
lm.fit = lm(Sales~.+Income:Advertising+Price:Age, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
The contrasts() function returns the coding that R uses for the dummy variables.
attach(Carseats)
## The following objects are masked from Carseats (position 15):
##
## Advertising, Age, CompPrice, Education, Income, Population,
## Price, Sales, ShelveLoc, Urban, US
contrasts(ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
?contrasts
R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location but lower sales than a good shelving location.
* 3.6.7 Writing Functions**
As we have seen, R comes with many useful functions, and still more func- tions are available by way of R libraries. However, we will often be inter- ested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR and MASS libraries, called LoadLibraries(). Before we have created the function, R returns an error if we try to call it.
LoadLibraries = function(){
library(ISLR)
library(MASS)
print ("The libraries have been loaded.")
}
Now if we type in LoadLibraries, R will tell us what is in the function.
LoadLibraries
## function(){
## library(ISLR)
## library(MASS)
## print ("The libraries have been loaded.")
## }
If we call the function, the libraries are loaded in and the print statement is output.
LoadLibraries()
## [1] "The libraries have been loaded."
3.7 Exercises
CONCEPTUAL
1. Describe the null hypothesis to which the p-values given in tables 3.4 correspond. Explain what conclusion you can draw on these p-values. your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of coefficients of the linear model. Answer:insert_here
2. Carefully explain the differences between the KNN classifier and KNN regression methods.
Answer:insert_here
3. Suppose we have a data set with five predictions,\(X_1=GPA\),\(X_2=IQ\),\(X_3=Gender\), and \(X_5=Interaction\) between GPA and Gender.